Skip to content

[PECOBLR-201] add variant support #560

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Conversation

shivam2680
Copy link
Contributor

@shivam2680 shivam2680 commented May 19, 2025

Description

This pull request introduces support for detecting and handling VARIANT column types in the Databricks SQL Thrift backend, along with corresponding tests for validation.
updated the _col_to_description and _hive_schema_to_description methods to process metadata for VARIANT types
Added unit and end-to-end tests to ensure proper functionality.

Testing details

End-to-End Tests:

  • Added tests/e2e/test_variant_types.py to validate VARIANT type detection and data retrieval. Includes tests for creating tables with VARIANT columns, inserting records, and verifying correct type handling and JSON parsing.

Unit Tests:

  • Tests cover scenarios like VARIANT type detection, handling of null or malformed metadata, and fallback behavior for missing Arrow schemas.

Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

@shivam2680 shivam2680 changed the title add variant support [PECOBLR-201]. add variant support May 19, 2025
@shivam2680 shivam2680 changed the title [PECOBLR-201]. add variant support [PECOBLR-201] add variant support May 19, 2025
Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

@@ -692,12 +692,36 @@ def _col_to_description(col):
else:
precision, scale = None, None

# Extract variant type from field if available
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is correct? I tried and was getting metadata as null when the column type is variant. Also for variant the pyarrow schema just shows string in my testing, shouldn't the server return variant type ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes,
debug output:
[SHIVAM] field pyarrow.Field<CAST(1 AS VARIANT): string>
[SHIVAM] field metadata {b'Spark:DataType:SqlName': b'VARIANT', b'Spark:DataType:JsonType': b'"variant"'}

Copy link
Contributor

@jprakash-db jprakash-db Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shivam2680 I am getting this as the arrow_schema, where metadata is null. Is this some transient behaviour ? or am I missing something
Screenshot 2025-06-17 at 1 43 18 PM

Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

@shivam2680 shivam2680 requested a review from jprakash-db June 17, 2025 15:03
Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Comment on lines 706 to 715
if field is not None:
try:
# Check for variant type in metadata
if field.metadata and b"Spark:DataType:SqlName" in field.metadata:
sql_type = field.metadata.get(b"Spark:DataType:SqlName")
if sql_type == b"VARIANT":
cleaned_type = "variant"
except Exception as e:
logger.debug(f"Could not extract variant type from field: {e}")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check with eng-sqlgateway if there is a way to get this from thrift metadata. python connector uses thrift metadata for getting metadata

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there is some documentation/contract around it or is it purely from empirical evidence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these schema bytes are read from t_result_set_metadata_resp.arrowSchema itself. Please refer to https://sourcegraph.prod.databricks-corp.com/databricks/databricks-sql-python/-/blob/src/databricks/sql/backend/thrift_backend.py?L812

Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Copy link
Contributor

@jprakash-db jprakash-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you plz incorporate @pytest.mark.parametrize , I feel a lot of test code duplication can be avoided

@@ -2356,6 +2356,149 @@ def test_execute_command_sets_complex_type_fields_correctly(
t_execute_statement_req.useArrowNativeTypes.intervalTypesAsArrow
)

def test_col_to_description_with_variant_type(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is too much code duplication, like multiple tests just need different arguments. Can you use pytest fuxtures with arguments

Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

@shivam2680
Copy link
Contributor Author

The test_thrift_backend class inherits from unittest.TestCase, which is designed for unittest's test discovery and execution model. When you use @pytest.mark.parametrize with a unittest.TestCase class:

  • Pytest creates separate test instances for each parameter set
  • Unittest's test runner doesn't understand these parametrized arguments
  • Result: TypeError: missing required positional arguments

@shivam2680 shivam2680 requested a review from jprakash-db August 20, 2025 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants